# Lightweight Multimodal
Hyperclovax SEED Vision Instruct 3B
Other
HyperCLOVAX-SEED-Vision-Instruct-3B is a lightweight multimodal model developed by NAVER, featuring image-text understanding and text generation capabilities, with special optimization for Korean language processing.
Text-to-Image
Transformers

H
naver-hyperclovax
160.75k
170
Barcenas 4b
A multimodal model trained based on google/gemma-3-4b-it, specializing in high-quality data processing for mathematics, programming, science, and puzzle-solving domains.
Image-to-Text
Transformers English

B
Danielbrdz
15
2
Heron NVILA Lite 1B
Apache-2.0
A Japanese visual language model trained based on the NVILA-Lite architecture, supporting image-text interaction in both Japanese and English
Image-to-Text
Safetensors Supports Multiple Languages
H
turing-motors
460
2
Smolvlm2 256M Video Instruct Mlx
Apache-2.0
This is a video-text-to-text model converted based on the MLX framework, suitable for video understanding and instruction-following tasks.
Image-to-Text
Transformers English

S
mlx-community
591
7
Smolvlm2 500M Video Instruct
Apache-2.0
A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.
Image-to-Text
Transformers English

S
HuggingFaceTB
17.89k
56
Smolvlm2 256M Video Instruct
Apache-2.0
SmolVLM2-256M-Video is a lightweight multimodal model specifically designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.
Image-to-Text
Transformers English

S
HuggingFaceTB
22.16k
53
T Lite It 1.0 Quants GGUF
T-lite-it-1.0 is an image-to-text model supporting Russian and English, converted from the GGUF format.
Large Language Model Supports Multiple Languages
T
DefaultDF
49
0
Nanollava 1.5
Apache-2.0
nanoLLaVA-1.5 is a vision-language model with under 1 billion parameters, designed specifically for edge devices—compact yet powerful.
Image-to-Text
Transformers English

N
qnguyen3
442
109
Imp V1.5 4B Phi3
Apache-2.0
Imp-v1.5-4B-Phi3 is a high-performance lightweight multimodal large model with only 4 billion parameters, built on the Phi-3 framework and SigLIP visual encoder.
Text-to-Image
Transformers

I
MILVLG
140
7
Moondream2 Llamafile
Apache-2.0
moondream2 is a compact vision-language model specifically designed for efficient operation on edge devices, offering convenient deployment through the llamafile format.
Image-to-Text
M
cjpais
310
30
Nanollava
Apache-2.0
nanoLLaVA is a 1B-parameter vision-language model specifically designed for edge devices, featuring efficient operation.
Text-to-Image
Transformers English

N
qnguyen3
2,851
154
Minicpm V
MiniCPM-V is an efficient lightweight multimodal model optimized for edge device deployment, supporting bilingual (Chinese-English) interaction and outperforming models of similar scale.
Text-to-Image
Transformers

M
openbmb
19.74k
173
Moondream1
A 1.6B-parameter multimodal model combining SigLIP and Phi-1.5 architectures, supporting image understanding and Q&A tasks
Image-to-Text
Transformers English

M
vikhyatk
70.48k
487
Tiny Llava V1 Hf
Apache-2.0
TinyLLaVA is a compact large-scale multimodal model framework focused on vision-language tasks, featuring small parameter size yet excellent performance.
Image-to-Text
Transformers Supports Multiple Languages

T
bczhou
2,372
57
Uform Gen Chat
Apache-2.0
UForm-Gen-Chat is the fine-tuned multimodal conversational version of UForm-Gen, primarily used for image caption generation and visual question answering tasks.
Image-to-Text
Transformers English

U
unum-cloud
65
19
Uform Gen
Apache-2.0
UForm-Gen is a small generative vision-language model primarily used for image caption generation and visual question answering.
Image-to-Text
Transformers English

U
unum-cloud
152
44
Featured Recommended AI Models